Vincent Bernat: IPv6 route lookup on Linux
In a previous article, we had a look at IPv4 route lookup on Linux. Let s see how different IPv6 is.
Lookup trie implementation
Looking up a prefix in a routing table comes down to find the most specific
entry matching the requested destination. A common structure for this task is
the trie, a tree structure where each node has its parent as prefix.
With IPv4, Linux uses a level-compressed trie (or LPC-trie), providing
good performances with low memory usage. For IPv6, Linux uses a more
classic radix tree (or Patricia trie). There are three reasons for not
sharing:
- The IPv6 implementation (introduced in Linux 2.1.8, 1996) predates the IPv4
implementation based on LPC-tries (in Linux 2.6.13, commit 19baf839ff4a).
- The feature set is different. Notably, IPv6 supports source-specific
routing1 (since Linux 2.1.120, 1998).
- The IPv4 address space is denser than the IPv6 address
space. Level-compression is therefore quite efficient with IPv4. This may not
be the case with IPv6.
The trie in the below illustration encodes 6 prefixes:
For more in-depth explanation on the different ways to encode a routing table
into a trie and a better understanding of radix trees, see
the explanations for IPv4.
The following figure shows the in-memory representation of the previous radix
tree. Each node corresponds to a struct fib6_node
. When
a node has the RTN_RTINFO
flag set, it embeds a pointer to
a struct rt6_info
containing information about the
next-hop.
The fib6_lookup_1()
function walks the radix tree in two
steps:
- walking down the tree to locate the potential candidate, and
- checking the candidate and, if needed, backtracking until a match.
Here is a slightly simplified version without source-specific routing:
static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
struct in6_addr *addr)
struct fib6_node *fn;
__be32 dir;
/* Step 1: locate potential candidate */
fn = root;
for (;;)
struct fib6_node *next;
dir = addr_bit_set(addr, fn->fn_bit);
next = dir ? fn->right : fn->left;
if (next)
fn = next;
continue;
break;
/* Step 2: check prefix and backtrack if needed */
while (fn)
if (fn->fn_flags & RTN_RTINFO)
struct rt6key *key;
key = fn->leaf->rt6i_dst;
if (ipv6_prefix_equal(&key->addr, addr, key->plen))
if (fn->fn_flags & RTN_RTINFO)
return fn;
if (fn->fn_flags & RTN_ROOT)
break;
fn = fn->parent;
return NULL;
Caching
While IPv4 lost its route cache in Linux 3.6 (commit 5e9965c15ba8), IPv6
still has a caching mechanism. However cache entries are directly put in the
radix tree instead of a distinct structure.
Since Linux 2.1.30 (1997) and until Linux 4.2 (commit 45e4fd26683c), almost
any successful route lookup inserts a cache entry in the radix tree. For
example, a router forwarding a ping between 2001:db8:1::1
and 2001:db8:3::1
would get those two cache entries:
$ ip -6 route show cache
2001:db8:1::1 dev r2-r1 metric 0
cache
2001:db8:3::1 via 2001:db8:2::2 dev r2-r3 metric 0
cache
These entries are cleaned up by the ip6_dst_gc()
function
controlled by the following parameters:
$ sysctl -a grep -F net.ipv6.route
net.ipv6.route.gc_elasticity = 9
net.ipv6.route.gc_interval = 30
net.ipv6.route.gc_min_interval = 0
net.ipv6.route.gc_min_interval_ms = 500
net.ipv6.route.gc_thresh = 1024
net.ipv6.route.gc_timeout = 60
net.ipv6.route.max_size = 4096
net.ipv6.route.mtu_expires = 600
The garbage collector is triggered at most every 500 ms when there are more than
1024 entries or at least every 30 seconds. The garbage collection won t run for
more than 60 seconds, except if there are more than 4096 routes. When running,
it will first delete entries older than 30 seconds. If the number of cache
entries is still greater than 4096, it will continue to delete more recent
entries (but no more recent than 512 jiffies, which is the value of
gc_elasticity
) after a 500 ms pause.
Starting from Linux 4.2 (commit 45e4fd26683c), only a PMTU exception would
create a cache entry. A router doesn t have to handle those exceptions, so only
hosts would get cache entries. And they should be pretty rare. Martin KaFai
Lau explains:
Out of all IPv6 RTF_CACHE
routes that are created, the percentage that has a
different MTU is very small. In one of our end-user facing proxy server, only
1k out of 80k RTF_CACHE
routes have a smaller MTU. For our DC traffic, there
is no MTU exception.
Here is how a cache entry with a PMTU exception looks like:
$ ip -6 route show cache
2001:db8:1::50 via 2001:db8:1::13 dev out6 metric 0
cache expires 573sec mtu 1400 pref medium
Performance
We consider three distinct scenarios:
- Excerpt of an Internet full view
- In this scenario, Linux acts as an edge router attached to the default-free
zone. Currently, the size of such a routing table is a little bit
above 40,000 routes.
- /48 prefixes spread linearly with different densities
- Linux acts as a core router inside a datacenter. Each customer or rack gets
one or several /48 networks, which need to be routed around. With a density of
1, /48 subnets are contiguous.
- /128 prefixes spread randomly in a fixed /108 subnet
- Linux acts as a leaf router for a /64 subnet with hosts getting their IP using
autoconfiguration. It is assumed all hosts share the same OUI and therefore,
the first 40 bits are fixed. In this scenario, neighbor reachability
information for the /64 subnet are converted into routes by some external
process and redistributed among other routers sharing the same subnet2.
Route lookup performance
With the help of a small kernel module, we can accurately benchmark3
the ip6_route_output_flags()
function and
correlate the results with the radix tree size:
Getting meaningful results is challenging due to the size of the address
space. None of the scenarios have a fallback route and we only measure time for
successful hits4. For the full view scenario, only the range from
2400::/16
to 2a06::/16
is scanned (it contains more than half of the
routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48
scenario, the range from the first /48 to the last one is scanned. For each
range, 5000 addresses are picked semi-randomly. This operation is repeated until
we get 5000 hits or until 1 million tests have been executed.
The relation between the maximum depth and the lookup time is incomplete and I
can t explain the difference of performance between the different densities of
the /48 scenario.
We can extract two important performance points:
- With a full view, the lookup time is 450 ns. This is almost ten times
the budget for forwarding at 10 Gbps which is about 50 ns.
- With an almost empty routing table, the lookup time is 150 ns. This
is still over the time budget for forwarding at 10 Gbps.
With IPv4, the lookup time for an almost empty table was 20 ns while the lookup
time for a full view (500,000 routes) was a bit above 50 ns. How to explain such
a difference? First, the maximum depth of the IPv4 LPC-trie with 500,000 routes
was 6, while the maximum depth of the IPv6 radix tree for 40,000 routes is 40.
Second, while both IPv4 s fib_lookup()
and
IPv6 s ip6_route_output_flags()
functions have a
fixed cost implied by the evaluation of routing rules, IPv4 has several
optimizations when the rules are left unmodified5. Those optimizations
are removed on the first modification. If we cancel those optimizations, the
lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns
difference with IPv6 to be explained.
Let s compare how time is spent in each lookup function. Here is
a CPU flamegraph for IPv4 s fib_lookup()
:
Only 50% of the time is spent in the actual route lookup. The remaining time is
spent evaluating the routing rules (about 30 ns). This ratio is dependent on the
number of routes we inserted (only 1000 in this example). It should be noted the
fib_table_lookup()
function is executed twice: once with the local routing
table and once with the main routing table.
The equivalent flamegraph for
IPv6 s ip6_route_output_flags()
is depicted below:
Here is an approximate breakdown on the time spent:
- 50% is spent in the route lookup in the main table,
- 15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),
- 5% is spent in the route lookup of the local table,
- most of the remaining is spent in routing rule evaluation (about 100 ns)6.
Why does the evaluation of routing rules is less efficient with IPv6? Again, I
don t have a definitive answer.
History
The following graph shows the performance progression of route lookups through
Linux history:
All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is
able to compile older kernels as well as current ones. The kernel configuration
is the default one with CONFIG_SMP
, CONFIG_IPV6
,
CONFIG_IPV6_MULTIPLE_TABLES
and CONFIG_IPV6_SUBTREES
options enabled. Some
other unrelated options are enabled to be able to boot them in a virtual machine
and run the benchmark.
There are three notable performance changes:
- In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix
the undesirable sharing of route-specific metrics by all cache entries
(commit 21efcfa0ff27). Each cache entry now gets its own metrics, which
explains the performance hit for the non-/128 scenarios.
- In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor
entry in
struct rt6_info
(commit 887c95cc1da5). This should have lead to
a performance increase. The small regression may be due to cache-related
issues.
- In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for
most route lookups. The most sensible performance improvement comes
with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c,
which effectively removes creation of cache entries, except for PMTU
exceptions.
Insertion performance
Another interesting performance-related metric is the insertion time. Linux is
able to insert a full view in less than two seconds. For some reason, the
insertion time is not linear above 50,000 routes and climbs very fast to 60
seconds for 500,000 routes.
Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2
million routes in less than 10 seconds.
Memory usage
Radix tree nodes (struct fib6_node
) and routing information (struct
rt6_info
) are allocated with the slab allocator7. It is therefore
possible to extract the information from /proc/slabinfo
when the kernel is
booted with the slab_nomerge
flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo cut -f1 -d:
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes 76101 76104 64 63 1
ip6_dst_cache 40090 40090 384 10 1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20
MiB). The number of struct rt6_info
matches the number of routes while the
number of nodes is roughly twice the number of routes:
The memory usage is therefore quite predictable and reasonable, as even a small
single-board computer can support several full views (20 MiB for each):
The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed
for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The
difference is mainly due to the size of struct rt6_info
(336 bytes) compared
to the size of IPv4 s struct fib_alias
(48 bytes): IPv4 puts most information
about next-hops in struct fib_info
structures that are shared with many
entries.
Conclusion
The takeaways from this article are:
- upgrade to Linux 4.2 or more recent to avoid excessive caching,
- route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES
option incurs a fixed penalty of 100 ns by lookup,
- memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in
term of optimizations. Hopefully, things are changing as its adoption and use
at scale are increasing.
-
For a given destination prefix, it s possible to attach source-specific
prefixes:
ip -6 route add 2001:db8:1::/64 \
from 2001:db8:3::/64 \
via fe80::1 \
dev eth0
Lookup is first done on the destination address, then on the source address.
-
This is quite different of the classic scenario where Linux acts as a
gateway for a /64 subnet. In this case, the neighbor subsystem stores the
reachability information for each host and the routing table only contains a
single /64 prefix.
-
The measurements are done in a virtual machine with one vCPU and no
neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during
the experiment (CPU governor set to performance). The benchmark is
single-threaded. Many lookups are performed and the result reported is the
median value. Timings of individual runs are computed from the TSC.
-
Most of the packets in the network are expected to be routed to a
destination. However, this also means the backtracking code path is not used
in the /128 and /48 scenarios. Having a fallback route gives far different
results and make it difficult to ensure we explore the address space
correctly.
-
The exact same optimizations could be applied for IPv6. Nobody did
it yet.
-
Compiling out table support effectively removes those last
100 ns.
-
There is also per-CPU pointers allocated directly (4 bytes per entry
per CPU on a 64-bit architecture). We ignore this detail.
static struct fib6_node *fib6_lookup_1(struct fib6_node *root, struct in6_addr *addr) struct fib6_node *fn; __be32 dir; /* Step 1: locate potential candidate */ fn = root; for (;;) struct fib6_node *next; dir = addr_bit_set(addr, fn->fn_bit); next = dir ? fn->right : fn->left; if (next) fn = next; continue; break; /* Step 2: check prefix and backtrack if needed */ while (fn) if (fn->fn_flags & RTN_RTINFO) struct rt6key *key; key = fn->leaf->rt6i_dst; if (ipv6_prefix_equal(&key->addr, addr, key->plen)) if (fn->fn_flags & RTN_RTINFO) return fn; if (fn->fn_flags & RTN_ROOT) break; fn = fn->parent; return NULL;
2001:db8:1::1
and 2001:db8:3::1
would get those two cache entries:
$ ip -6 route show cache 2001:db8:1::1 dev r2-r1 metric 0 cache 2001:db8:3::1 via 2001:db8:2::2 dev r2-r3 metric 0 cache
ip6_dst_gc()
function
controlled by the following parameters:
$ sysctl -a grep -F net.ipv6.route net.ipv6.route.gc_elasticity = 9 net.ipv6.route.gc_interval = 30 net.ipv6.route.gc_min_interval = 0 net.ipv6.route.gc_min_interval_ms = 500 net.ipv6.route.gc_thresh = 1024 net.ipv6.route.gc_timeout = 60 net.ipv6.route.max_size = 4096 net.ipv6.route.mtu_expires = 600
gc_elasticity
) after a 500 ms pause.
Starting from Linux 4.2 (commit 45e4fd26683c), only a PMTU exception would
create a cache entry. A router doesn t have to handle those exceptions, so only
hosts would get cache entries. And they should be pretty rare. Martin KaFai
Lau explains:
Out of all IPv6Here is how a cache entry with a PMTU exception looks like:RTF_CACHE
routes that are created, the percentage that has a different MTU is very small. In one of our end-user facing proxy server, only 1k out of 80kRTF_CACHE
routes have a smaller MTU. For our DC traffic, there is no MTU exception.
$ ip -6 route show cache 2001:db8:1::50 via 2001:db8:1::13 dev out6 metric 0 cache expires 573sec mtu 1400 pref medium
Performance
We consider three distinct scenarios:
- Excerpt of an Internet full view
- In this scenario, Linux acts as an edge router attached to the default-free
zone. Currently, the size of such a routing table is a little bit
above 40,000 routes.
- /48 prefixes spread linearly with different densities
- Linux acts as a core router inside a datacenter. Each customer or rack gets
one or several /48 networks, which need to be routed around. With a density of
1, /48 subnets are contiguous.
- /128 prefixes spread randomly in a fixed /108 subnet
- Linux acts as a leaf router for a /64 subnet with hosts getting their IP using
autoconfiguration. It is assumed all hosts share the same OUI and therefore,
the first 40 bits are fixed. In this scenario, neighbor reachability
information for the /64 subnet are converted into routes by some external
process and redistributed among other routers sharing the same subnet2.
Route lookup performance
With the help of a small kernel module, we can accurately benchmark3
the ip6_route_output_flags()
function and
correlate the results with the radix tree size:
Getting meaningful results is challenging due to the size of the address
space. None of the scenarios have a fallback route and we only measure time for
successful hits4. For the full view scenario, only the range from
2400::/16
to 2a06::/16
is scanned (it contains more than half of the
routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48
scenario, the range from the first /48 to the last one is scanned. For each
range, 5000 addresses are picked semi-randomly. This operation is repeated until
we get 5000 hits or until 1 million tests have been executed.
The relation between the maximum depth and the lookup time is incomplete and I
can t explain the difference of performance between the different densities of
the /48 scenario.
We can extract two important performance points:
- With a full view, the lookup time is 450 ns. This is almost ten times
the budget for forwarding at 10 Gbps which is about 50 ns.
- With an almost empty routing table, the lookup time is 150 ns. This
is still over the time budget for forwarding at 10 Gbps.
With IPv4, the lookup time for an almost empty table was 20 ns while the lookup
time for a full view (500,000 routes) was a bit above 50 ns. How to explain such
a difference? First, the maximum depth of the IPv4 LPC-trie with 500,000 routes
was 6, while the maximum depth of the IPv6 radix tree for 40,000 routes is 40.
Second, while both IPv4 s fib_lookup()
and
IPv6 s ip6_route_output_flags()
functions have a
fixed cost implied by the evaluation of routing rules, IPv4 has several
optimizations when the rules are left unmodified5. Those optimizations
are removed on the first modification. If we cancel those optimizations, the
lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns
difference with IPv6 to be explained.
Let s compare how time is spent in each lookup function. Here is
a CPU flamegraph for IPv4 s fib_lookup()
:
Only 50% of the time is spent in the actual route lookup. The remaining time is
spent evaluating the routing rules (about 30 ns). This ratio is dependent on the
number of routes we inserted (only 1000 in this example). It should be noted the
fib_table_lookup()
function is executed twice: once with the local routing
table and once with the main routing table.
The equivalent flamegraph for
IPv6 s ip6_route_output_flags()
is depicted below:
Here is an approximate breakdown on the time spent:
- 50% is spent in the route lookup in the main table,
- 15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),
- 5% is spent in the route lookup of the local table,
- most of the remaining is spent in routing rule evaluation (about 100 ns)6.
Why does the evaluation of routing rules is less efficient with IPv6? Again, I
don t have a definitive answer.
History
The following graph shows the performance progression of route lookups through
Linux history:
All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is
able to compile older kernels as well as current ones. The kernel configuration
is the default one with CONFIG_SMP
, CONFIG_IPV6
,
CONFIG_IPV6_MULTIPLE_TABLES
and CONFIG_IPV6_SUBTREES
options enabled. Some
other unrelated options are enabled to be able to boot them in a virtual machine
and run the benchmark.
There are three notable performance changes:
- In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix
the undesirable sharing of route-specific metrics by all cache entries
(commit 21efcfa0ff27). Each cache entry now gets its own metrics, which
explains the performance hit for the non-/128 scenarios.
- In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor
entry in
struct rt6_info
(commit 887c95cc1da5). This should have lead to
a performance increase. The small regression may be due to cache-related
issues.
- In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for
most route lookups. The most sensible performance improvement comes
with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c,
which effectively removes creation of cache entries, except for PMTU
exceptions.
Insertion performance
Another interesting performance-related metric is the insertion time. Linux is
able to insert a full view in less than two seconds. For some reason, the
insertion time is not linear above 50,000 routes and climbs very fast to 60
seconds for 500,000 routes.
Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2
million routes in less than 10 seconds.
Memory usage
Radix tree nodes (struct fib6_node
) and routing information (struct
rt6_info
) are allocated with the slab allocator7. It is therefore
possible to extract the information from /proc/slabinfo
when the kernel is
booted with the slab_nomerge
flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo cut -f1 -d:
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes 76101 76104 64 63 1
ip6_dst_cache 40090 40090 384 10 1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20
MiB). The number of struct rt6_info
matches the number of routes while the
number of nodes is roughly twice the number of routes:
The memory usage is therefore quite predictable and reasonable, as even a small
single-board computer can support several full views (20 MiB for each):
The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed
for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The
difference is mainly due to the size of struct rt6_info
(336 bytes) compared
to the size of IPv4 s struct fib_alias
(48 bytes): IPv4 puts most information
about next-hops in struct fib_info
structures that are shared with many
entries.
Conclusion
The takeaways from this article are:
- upgrade to Linux 4.2 or more recent to avoid excessive caching,
- route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES
option incurs a fixed penalty of 100 ns by lookup,
- memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in
term of optimizations. Hopefully, things are changing as its adoption and use
at scale are increasing.
-
For a given destination prefix, it s possible to attach source-specific
prefixes:
ip -6 route add 2001:db8:1::/64 \
from 2001:db8:3::/64 \
via fe80::1 \
dev eth0
Lookup is first done on the destination address, then on the source address.
-
This is quite different of the classic scenario where Linux acts as a
gateway for a /64 subnet. In this case, the neighbor subsystem stores the
reachability information for each host and the routing table only contains a
single /64 prefix.
-
The measurements are done in a virtual machine with one vCPU and no
neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during
the experiment (CPU governor set to performance). The benchmark is
single-threaded. Many lookups are performed and the result reported is the
median value. Timings of individual runs are computed from the TSC.
-
Most of the packets in the network are expected to be routed to a
destination. However, this also means the backtracking code path is not used
in the /128 and /48 scenarios. Having a fallback route gives far different
results and make it difficult to ensure we explore the address space
correctly.
-
The exact same optimizations could be applied for IPv6. Nobody did
it yet.
-
Compiling out table support effectively removes those last
100 ns.
-
There is also per-CPU pointers allocated directly (4 bytes per entry
per CPU on a 64-bit architecture). We ignore this detail.
ip6_route_output_flags()
function and
correlate the results with the radix tree size:
Getting meaningful results is challenging due to the size of the address
space. None of the scenarios have a fallback route and we only measure time for
successful hits4. For the full view scenario, only the range from
2400::/16
to 2a06::/16
is scanned (it contains more than half of the
routes). For the /128 scenario, the whole /108 subnet is scanned. For the /48
scenario, the range from the first /48 to the last one is scanned. For each
range, 5000 addresses are picked semi-randomly. This operation is repeated until
we get 5000 hits or until 1 million tests have been executed.
The relation between the maximum depth and the lookup time is incomplete and I
can t explain the difference of performance between the different densities of
the /48 scenario.
We can extract two important performance points:
- With a full view, the lookup time is 450 ns. This is almost ten times the budget for forwarding at 10 Gbps which is about 50 ns.
- With an almost empty routing table, the lookup time is 150 ns. This is still over the time budget for forwarding at 10 Gbps.
fib_lookup()
and
IPv6 s ip6_route_output_flags()
functions have a
fixed cost implied by the evaluation of routing rules, IPv4 has several
optimizations when the rules are left unmodified5. Those optimizations
are removed on the first modification. If we cancel those optimizations, the
lookup time for IPv4 is impacted by about 30 ns. This still leaves a 100 ns
difference with IPv6 to be explained.
Let s compare how time is spent in each lookup function. Here is
a CPU flamegraph for IPv4 s fib_lookup()
:
Only 50% of the time is spent in the actual route lookup. The remaining time is
spent evaluating the routing rules (about 30 ns). This ratio is dependent on the
number of routes we inserted (only 1000 in this example). It should be noted the
fib_table_lookup()
function is executed twice: once with the local routing
table and once with the main routing table.
The equivalent flamegraph for
IPv6 s ip6_route_output_flags()
is depicted below:
Here is an approximate breakdown on the time spent:
- 50% is spent in the route lookup in the main table,
- 15% is spent in handling locking (IPv4 is using the more efficient RCU mechanism),
- 5% is spent in the route lookup of the local table,
- most of the remaining is spent in routing rule evaluation (about 100 ns)6.
History
The following graph shows the performance progression of route lookups through
Linux history:
All kernels are compiled with GCC 4.9 (from Debian Jessie). This version is
able to compile older kernels as well as current ones. The kernel configuration
is the default one with CONFIG_SMP
, CONFIG_IPV6
,
CONFIG_IPV6_MULTIPLE_TABLES
and CONFIG_IPV6_SUBTREES
options enabled. Some
other unrelated options are enabled to be able to boot them in a virtual machine
and run the benchmark.
There are three notable performance changes:
- In Linux 3.1, Eric Dumazet delays a bit the copy of route metrics to fix
the undesirable sharing of route-specific metrics by all cache entries
(commit 21efcfa0ff27). Each cache entry now gets its own metrics, which
explains the performance hit for the non-/128 scenarios.
- In Linux 3.9, Yoshifuji Hideaki removes the reference to the neighbor
entry in
struct rt6_info
(commit 887c95cc1da5). This should have lead to
a performance increase. The small regression may be due to cache-related
issues.
- In Linux 4.2, Martin KaFai Lau prevents the creation of cache entries for
most route lookups. The most sensible performance improvement comes
with commit 4b32b5ad31a6. The second one is from commit 45e4fd26683c,
which effectively removes creation of cache entries, except for PMTU
exceptions.
Insertion performance
Another interesting performance-related metric is the insertion time. Linux is
able to insert a full view in less than two seconds. For some reason, the
insertion time is not linear above 50,000 routes and climbs very fast to 60
seconds for 500,000 routes.
Despite its more complex insertion logic, the IPv4 subsystem is able to insert 2
million routes in less than 10 seconds.
Memory usage
Radix tree nodes (struct fib6_node
) and routing information (struct
rt6_info
) are allocated with the slab allocator7. It is therefore
possible to extract the information from /proc/slabinfo
when the kernel is
booted with the slab_nomerge
flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo cut -f1 -d:
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes 76101 76104 64 63 1
ip6_dst_cache 40090 40090 384 10 1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20
MiB). The number of struct rt6_info
matches the number of routes while the
number of nodes is roughly twice the number of routes:
The memory usage is therefore quite predictable and reasonable, as even a small
single-board computer can support several full views (20 MiB for each):
The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed
for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The
difference is mainly due to the size of struct rt6_info
(336 bytes) compared
to the size of IPv4 s struct fib_alias
(48 bytes): IPv4 puts most information
about next-hops in struct fib_info
structures that are shared with many
entries.
Conclusion
The takeaways from this article are:
- upgrade to Linux 4.2 or more recent to avoid excessive caching,
- route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES
option incurs a fixed penalty of 100 ns by lookup,
- memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in
term of optimizations. Hopefully, things are changing as its adoption and use
at scale are increasing.
-
For a given destination prefix, it s possible to attach source-specific
prefixes:
ip -6 route add 2001:db8:1::/64 \
from 2001:db8:3::/64 \
via fe80::1 \
dev eth0
Lookup is first done on the destination address, then on the source address.
-
This is quite different of the classic scenario where Linux acts as a
gateway for a /64 subnet. In this case, the neighbor subsystem stores the
reachability information for each host and the routing table only contains a
single /64 prefix.
-
The measurements are done in a virtual machine with one vCPU and no
neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during
the experiment (CPU governor set to performance). The benchmark is
single-threaded. Many lookups are performed and the result reported is the
median value. Timings of individual runs are computed from the TSC.
-
Most of the packets in the network are expected to be routed to a
destination. However, this also means the backtracking code path is not used
in the /128 and /48 scenarios. Having a fallback route gives far different
results and make it difficult to ensure we explore the address space
correctly.
-
The exact same optimizations could be applied for IPv6. Nobody did
it yet.
-
Compiling out table support effectively removes those last
100 ns.
-
There is also per-CPU pointers allocated directly (4 bytes per entry
per CPU on a 64-bit architecture). We ignore this detail.
struct rt6_info
(commit 887c95cc1da5). This should have lead to
a performance increase. The small regression may be due to cache-related
issues.Memory usage
Radix tree nodes (struct fib6_node
) and routing information (struct
rt6_info
) are allocated with the slab allocator7. It is therefore
possible to extract the information from /proc/slabinfo
when the kernel is
booted with the slab_nomerge
flag:
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo cut -f1 -d:
name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>
fib6_nodes 76101 76104 64 63 1
ip6_dst_cache 40090 40090 384 10 1
In the above example, the used memory is 76104 64+40090 384 bytes (about 20
MiB). The number of struct rt6_info
matches the number of routes while the
number of nodes is roughly twice the number of routes:
The memory usage is therefore quite predictable and reasonable, as even a small
single-board computer can support several full views (20 MiB for each):
The LPC-trie used for IPv4 is more efficient: when 512 MiB of memory is needed
for IPv6 to store 1 million routes, only 128 MiB are needed for IPv4. The
difference is mainly due to the size of struct rt6_info
(336 bytes) compared
to the size of IPv4 s struct fib_alias
(48 bytes): IPv4 puts most information
about next-hops in struct fib_info
structures that are shared with many
entries.
Conclusion
The takeaways from this article are:
- upgrade to Linux 4.2 or more recent to avoid excessive caching,
- route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES
option incurs a fixed penalty of 100 ns by lookup,
- memory usage is fair (20 MiB for 40,000 routes).
Compared to IPv4, IPv6 in Linux doesn t foster the same interest, notably in
term of optimizations. Hopefully, things are changing as its adoption and use
at scale are increasing.
-
For a given destination prefix, it s possible to attach source-specific
prefixes:
ip -6 route add 2001:db8:1::/64 \
from 2001:db8:3::/64 \
via fe80::1 \
dev eth0
Lookup is first done on the destination address, then on the source address.
-
This is quite different of the classic scenario where Linux acts as a
gateway for a /64 subnet. In this case, the neighbor subsystem stores the
reachability information for each host and the routing table only contains a
single /64 prefix.
-
The measurements are done in a virtual machine with one vCPU and no
neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during
the experiment (CPU governor set to performance). The benchmark is
single-threaded. Many lookups are performed and the result reported is the
median value. Timings of individual runs are computed from the TSC.
-
Most of the packets in the network are expected to be routed to a
destination. However, this also means the backtracking code path is not used
in the /128 and /48 scenarios. Having a fallback route gives far different
results and make it difficult to ensure we explore the address space
correctly.
-
The exact same optimizations could be applied for IPv6. Nobody did
it yet.
-
Compiling out table support effectively removes those last
100 ns.
-
There is also per-CPU pointers allocated directly (4 bytes per entry
per CPU on a 64-bit architecture). We ignore this detail.
# sed -ne 2p -e '/^ip6_dst/p' -e '/^fib6_nodes/p' /proc/slabinfo cut -f1 -d: name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> fib6_nodes 76101 76104 64 63 1 ip6_dst_cache 40090 40090 384 10 1
- upgrade to Linux 4.2 or more recent to avoid excessive caching,
- route lookups are noticeably slower compared to IPv4 (by an order of magnitude),
CONFIG_IPV6_MULTIPLE_TABLES
option incurs a fixed penalty of 100 ns by lookup,- memory usage is fair (20 MiB for 40,000 routes).
-
For a given destination prefix, it s possible to attach source-specific
prefixes:
Lookup is first done on the destination address, then on the source address.
ip -6 route add 2001:db8:1::/64 \ from 2001:db8:3::/64 \ via fe80::1 \ dev eth0
- This is quite different of the classic scenario where Linux acts as a gateway for a /64 subnet. In this case, the neighbor subsystem stores the reachability information for each host and the routing table only contains a single /64 prefix.
- The measurements are done in a virtual machine with one vCPU and no neighbors. The host is an Intel Core i5-4670K running at 3.7 GHz during the experiment (CPU governor set to performance). The benchmark is single-threaded. Many lookups are performed and the result reported is the median value. Timings of individual runs are computed from the TSC.
- Most of the packets in the network are expected to be routed to a destination. However, this also means the backtracking code path is not used in the /128 and /48 scenarios. Having a fallback route gives far different results and make it difficult to ensure we explore the address space correctly.
- The exact same optimizations could be applied for IPv6. Nobody did it yet.
- Compiling out table support effectively removes those last 100 ns.
- There is also per-CPU pointers allocated directly (4 bytes per entry per CPU on a 64-bit architecture). We ignore this detail.